For project 2 we will be working with the same baseball data from project 1. There are lots of variables included in the data set but we will only be playing with a couple of them (Hits, Salary, OBP, Slugging, Walks, Home Runs, W-L% (which we turn into a binary categorical variables called “Winning”, where W-L% >= .500 is a “Win” season), and Games Played). Variables are for all 30 MLB teams in years 2020 and 2021 for a total of 60 observations with no NAs. All variables are averaged per-game because there were a different amount of games in those two seasons (that is to say, every time I say “Walks” on this page, I mean “Walks per game”). All ranking and batting data is from Baseball-Reference.com and salary data is from spotrac.com. The code below, much of it taken from project 1, is loading in our data and selecting the variables we’re interested in.
library(tidyverse)
# read your datasets in here, e.g., with read_csv()
payroll2020 <- read_csv("2020payroll.csv")
payroll2021 <- read_csv("2021payroll.csv")
rank2020 <- read_csv("2020rankings.csv")
rank2021 <- read_csv("2021rankings.csv")
batting2021 <- read_csv("2021batting.csv")
batting2020 <- read_csv("2020batting.csv")
# if your dataset needs tidying, do so here
payroll <- payroll2020 %>% mutate(Year = 2020) %>% bind_rows(payroll2021 %>%
mutate(Year = 2021))
rank <- rank2020 %>% mutate(Year = 2020) %>% slice(1:30) %>%
bind_rows(rank2021 %>% mutate(Year = 2021) %>% slice(1:30))
batting <- batting2020 %>% mutate(Year = 2020) %>% slice(1:30) %>%
bind_rows(batting2021 %>% mutate(Year = 2021) %>% slice(1:30))
full <- rank %>% full_join(batting, by = c("Tm", "Year")) %>%
full_join(payroll, by = c(Tm = "Team", "Year"))
df <- full %>% select(Tm, Year, Ratio = `W-L%`, Hits = H, Runs = R.y,
Salary = Total, OnBasePercent = OBP, Slugging = SLG, Walks = BB,
HR, BasesStolen = SB, G)
df <- df %>% mutate(Salary = as.numeric(str_remove_all(Salary,
"[$,]")), Winning = ifelse(df$Ratio < 0.5, "Lose", "Win"),
YearTeam = paste(df$Year, df$Tm, sep = " "), Hits = Hits/G,
Runs = Runs/G, Walks = Walks/G, HR = HR/G, BasesStolen = BasesStolen/G,
Salary = Salary/G)
df <- df %>% select(YearTeam, Winning, everything(), -Tm, -Year)
df %>% head
## # A tibble: 6 x 12
## YearTeam Winning Ratio Hits Runs Salary OnBasePercent Slugging Walks HR
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 Lo… Win 0.717 8.72 5.82 1.80e6 0.338 0.483 3.8 1.97
## 2 2020 Ta… Win 0.667 7.83 4.82 4.72e5 0.328 0.425 4.05 1.33
## 3 2020 Sa… Win 0.617 8.43 5.42 1.22e6 0.333 0.466 3.4 1.58
## 4 2020 Mi… Win 0.6 7.8 4.48 9.28e5 0.315 0.427 3.1 1.52
## 5 2020 Oa… Win 0.6 7.17 4.57 6.12e5 0.322 0.396 3.97 1.18
## 6 2020 At… Win 0.583 9.27 5.8 1.06e6 0.349 0.483 3.98 1.72
## # … with 2 more variables: BasesStolen <dbl>, G <dbl>
library(cluster)
pam_df <- df %>% select(Hits, Salary, OnBasePercent, Walks, HR)
# clustering code here
sil_width <- vector()
for (i in 2:20) {
pam_fit <- pam(pam_df, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:20, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:20)
pam <- pam(pam_df, 8)
pam
## Medoids:
## ID Hits Salary OnBasePercent Walks HR
## [1,] 1 8.716667 1798623.3 0.338 3.800000 1.9666667
## [2,] 59 8.006173 547734.1 0.309 3.314815 0.8888889
## [3,] 49 7.666667 1233266.6 0.314 3.055556 1.0864198
## [4,] 56 8.567901 897473.9 0.337 3.537037 1.1234568
## [5,] 47 8.055556 1100396.7 0.321 3.617284 1.1111111
## [6,] 53 8.092593 725515.7 0.314 3.240741 1.4074074
## [7,] 27 9.200000 1403506.5 0.330 3.116667 1.3500000
## [8,] 55 7.679012 358999.4 0.298 2.777778 0.9753086
## Clustering vector:
## [1] 1 2 3 4 2 5 4 6 7 1 4 3 2 4 7 3 6 3 4 5 5 2 7 3 8 5 7 6 5 8 5 1 8 3 2 4 5 3
## [39] 4 5 2 4 2 6 5 8 5 2 3 5 6 2 6 4 8 4 8 2 2 8
## Objective function:
## build swap
## 31332.42 31042.54
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
pam$silinfo$avg.width
## [1] 0.6733102
library(GGally)
pam_df %>% mutate(cluster = as.factor(pam$clustering)) %>% ggpairs(aes(color = cluster),
upper = list(continuous = wrap("cor", size = 1.4)))
Our average silhouette width, interestingly, is maximized at 8 clusters, where a reasonable structure is found. Said 8 clusters are heavily associated with the Salary variable; all 8 clusters have zero overlap in Salary. Due to the positive correlation of Salary and every other variable, the clusters also seem to have some looser boundaries in the other variables (the higher-salary clusters tend to also have more Walks, for example).
# PCA code here
pca1 <- princomp(pam_df, cor = T)
pca1
## Call:
## princomp(x = pam_df, cor = T)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 1.6641225 1.0598027 0.7651273 0.6819190 0.2389167
##
## 5 variables and 60 observations.
summary(pca1, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.6641225 1.0598027 0.7651273 0.6819190 0.23891672
## Proportion of Variance 0.5538608 0.2246363 0.1170840 0.0930027 0.01141624
## Cumulative Proportion 0.5538608 0.7784971 0.8955811 0.9885838 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Hits 0.364 0.718 0.137 0.244 0.523
## Salary 0.434 -0.122 -0.888
## OnBasePercent 0.560 0.200 0.416 -0.687
## Walks 0.369 -0.683 0.329 0.220 0.491
## HR 0.479 0.213 -0.848
library(plotly)
library(stringi)
YearTeam <- paste(full$Year, full$Tm, sep = "\n")
pca1$scores %>% as.data.frame %>% mutate(YearTeam = stri_replace_last_fixed(YearTeam,
" ", "\n")) %>% plot_ly(x = ~Comp.1, y = ~Comp.2, z = ~Comp.3,
color = ~YearTeam, type = "scatter3d", mode = "markers") %>%
layout(showlegend = FALSE)
PC 1 is associated positively with all 5 variables; Higher PC 1 scores tend to score higher in every variable. PC 2 is associated with a higher number of hits and a lower number of walks, and PC 3 is associated with higher numbers in every variable except Salary, with which is has a highly negative association. A team like the 2020 Atlanta Braves, which has a high PC1 and a high PC3, might be perceived as having great stats in spite of their smaller total Salary. All three of these PCs account for 89.5% of variance in the data.
# linear classifier code here
class_df <- df %>% select(-YearTeam, -G, -Ratio) %>% mutate(Winning = as.factor(Winning))
class_df
## # A tibble: 60 x 9
## Winning Hits Runs Salary OnBasePercent Slugging Walks HR BasesStolen
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Win 8.72 5.82 1798623. 0.338 0.483 3.8 1.97 0.483
## 2 Win 7.83 4.82 471511. 0.328 0.425 4.05 1.33 0.8
## 3 Win 8.43 5.42 1218299. 0.333 0.466 3.4 1.58 0.917
## 4 Win 7.8 4.48 927995. 0.315 0.427 3.1 1.52 0.233
## 5 Win 7.17 4.57 612003. 0.322 0.396 3.97 1.18 0.433
## 6 Win 9.27 5.8 1059366. 0.349 0.483 3.98 1.72 0.383
## 7 Win 8.9 5.1 894421. 0.326 0.453 2.98 1.6 0.333
## 8 Win 7.43 4.13 654985. 0.317 0.372 3.98 0.983 0.417
## 9 Win 7.03 4.42 1443270. 0.318 0.387 3.82 1.23 0.4
## 10 Win 7.88 5.25 1865651. 0.342 0.447 4.18 1.57 0.45
## # … with 50 more rows
y <- class_df$Winning
fit <- glm(Winning ~ ., data = class_df, family = "binomial")
score <- predict(fit, type = "response")
score %>% round(3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 1.000 0.790 0.925 0.784 0.975 0.995 0.925 0.404 0.973 0.995 0.803 0.633 0.551
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 0.834 0.305 0.212 0.777 0.935 0.607 0.871 0.021 0.008 0.053 0.092 0.163 0.206
## 27 28 29 30 31 32 33 34 35 36 37 38 39
## 0.035 0.091 0.012 0.054 0.854 0.989 0.995 0.921 0.904 0.811 0.459 0.621 0.800
## 40 41 42 43 44 45 46 47 48 49 50 51 52
## 0.060 0.876 0.904 0.746 0.887 0.360 0.346 0.154 0.063 0.023 0.100 0.132 0.008
## 53 54 55 56 57 58 59 60
## 0.379 0.409 0.017 0.077 0.004 0.011 0.027 0.034
class_diag(score, truth = df$Winning, positive = "Win")
## acc sens spec ppv f1 ba auc
## Metrics 0.8667 0.8621 0.871 0.8621 0.8621 0.8665 0.9255
table(actual = y, predicted = ifelse(score < 0.5, "Lose", "Win"))
## predicted
## actual Lose Win
## Lose 27 4
## Win 4 25
# cross-validation of linear classifier here
library(caret)
set.seed(1234)
cv <- trainControl(method = "cv", number = 5, classProbs = T,
savePredictions = T)
fit <- train(Winning ~ ., data = class_df, trControl = cv, method = "glm")
class_diag(fit$pred$pred %>% as.character, fit$pred$obs %>% as.character,
positive = "Win")
## acc sens spec ppv f1 ba auc
## Metrics 0.4833 1 0 0.4833 0.6517 0.5 0.7836
Our linear (logistic regression) classifier tried to predict from Hits, Runs, Salary, OBP, Slugging, Walks, HR, and Bases Stolen whether a team had a winning season. Our model performed well at first glance (.92 AUC), but after cross validation we see a decrease in performance (.78 AUC) due to overfitting. Still not a terrible fit overall!
# non-parametric classifier code here
knn_fit <- knn3(Winning ~ ., data = class_df, k = 3)
y_hat_knn <- predict(knn_fit, df)
class_diag(y_hat_knn[, 2], df$Winning, positive = "Win")
## acc sens spec ppv f1 ba auc
## Metrics 0.8667 0.7931 0.9355 0.92 0.8519 0.8643 0.9288
table(actual = y, predicted = ifelse(y_hat_knn[, 2] < 0.5, "Lose",
"Win"))
## predicted
## actual Lose Win
## Lose 29 2
## Win 6 23
# cross-validation of np classifier here
set.seed(1234)
k = 5 #choose number of folds
data <- class_df[sample(nrow(class_df)), ] #randomly order rows
folds <- cut(seq(1:nrow(class_df)), breaks = k, labels = F) #create folds
diags <- NULL
for (i in 1:k) {
## Create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Winning ## Truth labels for fold i
## Train model on training set (all but fold i)
fit <- knn3(Winning ~ ., data = train, k = 3)
## Test model on test set (fold i)
probs <- predict(fit, newdata = test)[, 2]
## Get diagnostics for fold i
diags <- rbind(diags, class_diag(probs, truth, positive = "Win"))
}
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.55 0.33332 0.7381 0.47 NaN 0.53572 0.59438
Our KNN model, attempting the same prediction as the linear model from before, also performs well at first glance (.92 AUC), but becomes pretty hot garbage when subject to cross validation (.59 AUC), suggesting heavy overfitting. Between the two, we can reasonably say that the logistic regression was a better prediction model.
# regression model code here
reg_df <- df %>% select(Ratio, OnBasePercent, Slugging, Walks)
fit <- lm(Ratio ~ ., data = reg_df) #predict mpg from all other variables
yhat <- predict(fit)
mean(reg_df$Ratio - yhat)^2
## [1] 1.977723e-32
# cross-validation of regression model here
set.seed(1234)
cv <- trainControl(method = "cv", number = 5, classProbs = T,
savePredictions = T)
fit <- train(Ratio ~ ., data = reg_df, trControl = cv, method = "lm")
fit$results$RMSE^2
## [1] 0.004406416
Our regression model tries to predict a team’s Win/Loss ratio from their OBP, SLG, and Walks. The model shows a near-zero MSE at first, but after cross validation the MSE jumps up to around .004, suggesting there is certainly some overfitting at play.
library(reticulate)
use_python("/usr/bin/python3")
list_of_teams <- full$Tm
list_of_teams
## [1] "Los Angeles Dodgers" "Tampa Bay Rays" "San Diego Padres"
## [4] "Minnesota Twins" "Oakland Athletics" "Atlanta Braves"
## [7] "Chicago White Sox" "Cleveland Indians" "Chicago Cubs"
## [10] "New York Yankees" "Toronto Blue Jays" "St. Louis Cardinals"
## [13] "Miami Marlins" "Cincinnati Reds" "Houston Astros"
## [16] "San Francisco Giants" "Milwaukee Brewers" "Philadelphia Phillies"
## [19] "Seattle Mariners" "Los Angeles Angels" "Colorado Rockies"
## [22] "Kansas City Royals" "New York Mets" "Washington Nationals"
## [25] "Baltimore Orioles" "Arizona Diamondbacks" "Boston Red Sox"
## [28] "Detroit Tigers" "Texas Rangers" "Pittsburgh Pirates"
## [31] "San Francisco Giants" "Los Angeles Dodgers" "Tampa Bay Rays"
## [34] "Houston Astros" "Milwaukee Brewers" "Chicago White Sox"
## [37] "Boston Red Sox" "New York Yankees" "Toronto Blue Jays"
## [40] "St. Louis Cardinals" "Seattle Mariners" "Atlanta Braves"
## [43] "Oakland Athletics" "Cincinnati Reds" "Philadelphia Phillies"
## [46] "Cleveland Indians" "San Diego Padres" "Detroit Tigers"
## [49] "New York Mets" "Los Angeles Angels" "Colorado Rockies"
## [52] "Kansas City Royals" "Minnesota Twins" "Chicago Cubs"
## [55] "Miami Marlins" "Washington Nationals" "Pittsburgh Pirates"
## [58] "Texas Rangers" "Arizona Diamondbacks" "Baltimore Orioles"
# python code here
list_of_teams = r.list_of_teams
list_of_teams = list(set(list_of_teams))
list_of_teams.sort()
list_of_teams
## ['Arizona Diamondbacks', 'Atlanta Braves', 'Baltimore Orioles', 'Boston Red Sox', 'Chicago Cubs', 'Chicago White Sox', 'Cincinnati Reds', 'Cleveland Indians', 'Colorado Rockies', 'Detroit Tigers', 'Houston Astros', 'Kansas City Royals', 'Los Angeles Angels', 'Los Angeles Dodgers', 'Miami Marlins', 'Milwaukee Brewers', 'Minnesota Twins', 'New York Mets', 'New York Yankees', 'Oakland Athletics', 'Philadelphia Phillies', 'Pittsburgh Pirates', 'San Diego Padres', 'San Francisco Giants', 'Seattle Mariners', 'St. Louis Cardinals', 'Tampa Bay Rays', 'Texas Rangers', 'Toronto Blue Jays', 'Washington Nationals']
for (i in 1:length(py$list_of_teams)) {
print(py$list_of_teams[i])
}
## [1] "Arizona Diamondbacks"
## [1] "Atlanta Braves"
## [1] "Baltimore Orioles"
## [1] "Boston Red Sox"
## [1] "Chicago Cubs"
## [1] "Chicago White Sox"
## [1] "Cincinnati Reds"
## [1] "Cleveland Indians"
## [1] "Colorado Rockies"
## [1] "Detroit Tigers"
## [1] "Houston Astros"
## [1] "Kansas City Royals"
## [1] "Los Angeles Angels"
## [1] "Los Angeles Dodgers"
## [1] "Miami Marlins"
## [1] "Milwaukee Brewers"
## [1] "Minnesota Twins"
## [1] "New York Mets"
## [1] "New York Yankees"
## [1] "Oakland Athletics"
## [1] "Philadelphia Phillies"
## [1] "Pittsburgh Pirates"
## [1] "San Diego Padres"
## [1] "San Francisco Giants"
## [1] "Seattle Mariners"
## [1] "St. Louis Cardinals"
## [1] "Tampa Bay Rays"
## [1] "Texas Rangers"
## [1] "Toronto Blue Jays"
## [1] "Washington Nationals"
To play with Python, we took the list of teams from our R data.frame, which contained the name of each team twice, and imported it to Python. In Python, we got rid of duplicates using a set and sorted the teams into alphabetical order with .sort() before sending them back to R and printing them all out.